Introduction

This tutorial will focus on analysing updated data of the worldwide Novel Corona virus (Covid-19) pandemic.
There are several data sources available online. We will use the data collected from a range of sources by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) and hosted on their GitHub repository. Note that this data is lagging 1 day behind!!

Analyse Data in R

Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.

Install Extra Packages

For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on EcoCloud) to try and avoid having to download the entire tidyverse. Please install the packages if you’re getting error when running the library() command. In addition we will use the plotly package for interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file and highcharter and countrycode packages to plot the data on a world map.
To install these packages, we use the install.packages('package') command, please note that the package name need to be quoted and that we only need to perform it once (if we followed step 3b above, or every time we restart the server if we didn’t), or when we want or need to update the package. Once the package was installed, we can load its functions using the library(package) command. Note that in this case we use the package name without quotes!.

# install required packages - needed only once! (comment with a # after first use)
# install.packages(c("dplyr", "ggplot2", "paletteer", "stringr", "readr","readxl", "highcharter", "countrycode", "plotly"))
# load required packages
library(dplyr)
library(readr)
library(stringr)
library(forcats)
library(readxl)
library(ggplot2)
library(paletteer)
library(plotly)
library(highcharter)
library(countrycode)

More information on installing and using R packages can be found in this tutorial.

Read Data

Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.). Please download the data files from Blackboard and put them in the data folder.

We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data from a file on our computer into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R (which will slightly change the structure of the resulting data frame).
> Note that the path to the files can be either relative or absolute to our current working directory and that we use aforward slash / (Unix/Mac style) and not a backslash \ (Windows style), to make life easier, make use of RStudio’s autocompletion feature._

# read data from the 'data' folder
covid_data <- read_csv("data/csse_country_covid_data_21_03_2020.csv")

Data Exploration

Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column:

#explore the data frame
head(covid_data) # show first 10 rows of the data and typr of variables
## # A tibble: 6 x 5
##   `Country/Region` Date       Confirmed Deaths Recovered
##   <chr>            <date>         <dbl>  <dbl>     <dbl>
## 1 Afghanistan      2020-01-22         0      0         0
## 2 Afghanistan      2020-01-23         0      0         0
## 3 Afghanistan      2020-01-24         0      0         0
## 4 Afghanistan      2020-01-25         0      0         0
## 5 Afghanistan      2020-01-26         0      0         0
## 6 Afghanistan      2020-01-27         0      0         0
str(covid_data) # show data structure
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 9960 obs. of  5 variables:
##  $ Country/Region: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Date          : Date, format: "2020-01-22" "2020-01-23" ...
##  $ Confirmed     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered     : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Country/Region` = col_character(),
##   ..   Date = col_date(format = ""),
##   ..   Confirmed = col_double(),
##   ..   Deaths = col_double(),
##   ..   Recovered = col_double()
##   .. )

Descriptive Statistics

We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continous numerical values.

# summary of variables in my data
summary(covid_data)
##  Country/Region          Date              Confirmed         Deaths       
##  Length:9960        Min.   :2020-01-22   Min.   :    0   Min.   :   0.00  
##  Class :character   1st Qu.:2020-02-05   1st Qu.:    0   1st Qu.:   0.00  
##  Mode  :character   Median :2020-02-20   Median :    0   Median :   0.00  
##                     Mean   :2020-02-20   Mean   :  488   Mean   :  16.83  
##                     3rd Qu.:2020-03-06   3rd Qu.:    3   3rd Qu.:   0.00  
##                     Max.   :2020-03-21   Max.   :81305   Max.   :4825.00  
##    Recovered      
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :    0.0  
##  Mean   :  179.2  
##  3rd Qu.:    0.0  
##  Max.   :71857.0

We can see that most of our data contains ‘0’ (check the median of Confirmed column). Just to confirm that, let;s plot a histogram of all the confirmed cases

ggplot(covid_data, aes(x=Confirmed)) +
  geom_histogram(fill="lightskyblue") +
  theme_bw(16)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What are the metadata columns that describe our observations?

Country 
Date

The data is evolving over a time-series, to there’s no point treating it as a random population sample.

Time-series plot

Let’s look at confirmed cases data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:

  1. First we group it by Country with group_by()
  2. Then we sort it within each Country by Date (from latest to earliest) with arrange(desc(Date))
  3. We select just the most recent data with slice(1) and remove grouping with ungroup()
  4. Next we arrange it by descending order of confirmed cases and select the top 10
  5. We extract just the Country information with subsetting using the $ sign
  6. We add Australia to the resulting vector of countries with c()
  7. Finally, we subset our original data to contain just the countries from our vector with filter()

Optional step:

  1. We can reorder the countries so they will be ordered in the legend by the number of cases with fct_reorder()

Then we can plot the data as number of cases in the y-axis and date in the x-axis.

# find the 10 most affected countries (to date)
latest_data <- covid_data %>% group_by(`Country/Region`) %>% arrange(desc(Date)) %>% slice(1) %>% ungroup() 
most_affected_countries <- latest_data %>% arrange(desc(Confirmed)) %>% slice(1:10) %>% .$`Country/Region` %>% 
  c(., "Australia")
# subset just the data from the 10 most affected countries and order them from the most affected to the least one
plot_data <- covid_data %>% filter(`Country/Region` %in% most_affected_countries) %>% 
  mutate(Country=fct_reorder(factor(`Country/Region`), Confirmed, .desc = TRUE))
# create a line plot the data
ggplot(plot_data, aes(x=Date, y=Confirmed, colour=Country)) +
  geom_line(size=1) + 
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country") +
  theme_bw(16)

It’s a bit hard to figure out how the pandemic evloves currently because the large numbers in China mask the rest. How can we make it more visible?

# create the plot
plot <- ggplot(plot_data, aes(x=Date, y=Confirmed, colour=Country)) +
  geom_line(size=0.6) + scale_y_log10(labels=scales::comma) +
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country") +
  theme_bw(16)
# show an interactive plot
ggplotly(plot)
## Warning: Transformation introduced infinite values in continuous y-axis

Why did we get a warning message? What can we infer from the graph (exponential increase)?

What happens when we take the log of 0??
We can see a very similar trend for most countries and while the curve is flattening in China, South Korea and slightly in Iran, it is still alarmingly steep in Italy and most European countries and the US.

We can also look at the number of daily new cases as a measure to how rapidly the virus spreads in the population. We will read another dataset collected by the EU of daily new cases (could we calculate it from the data we currently have?)
We need to wrangle this one up as well to make sure that the countries names are consistent (try to open the file in excel file and find the issue). We also calculate the number of total cases using mutate(Total_cases=cumsum(Cases), Total_Deaths=cumsum(Deaths) and create a new column that will hold the 3 character country code (iso3c) to help us map the data later on.

covid_daily_data <- read_excel("data/COVID-19-geographic-disbtribution-worldwide-2020-03-22.xlsx") %>% 
  rename(Country=`Countries and territories`) %>%
  mutate(Country=str_to_title(Country)) %>%
  group_by(Country) %>% arrange(Country, DateRep) %>%
  mutate(Total_cases=cumsum(Cases), Total_Deaths=cumsum(Deaths), 
         GeoId = case_when(GeoId=="UK"~"GB",
                           GeoId=="EL"~"GR",
                           GeoId=="JPG11668"~"UM",
                           TRUE~GeoId),
         iso3c=countrycode(GeoId, origin = 'iso2c', destination = 'iso3c')) %>% 
  ungroup()
## Warning in countrycode(GeoId, origin = "iso2c", destination = "iso3c"): Some values were not matched unambiguously: PYF
## Warning in countrycode(GeoId, origin = "iso2c", destination = "iso3c"): Some values were not matched unambiguously: XK
## Warning in countrycode(GeoId, origin = "iso2c", destination = "iso3c"): Some values were not matched unambiguously: AN

Similar to what we did earlier, we’ll group the data by Country and sort it by Date to select the most current data that we’ll use to select the most affected countries.

# find the 10 most affected countries and order them from the most affected to the least one
most_affected_eu_data <- covid_daily_data %>% group_by(Country) %>% arrange(desc(DateRep)) %>% slice(1) %>% ungroup()  %>% arrange(desc(Total_cases)) %>% slice(1:10) %>% .$Country %>% c(., "Australia")

# subset the daily data and add Australia as well
plot_data <- covid_daily_data %>% filter(Country %in% most_affected_eu_data) %>% 
  mutate(Country=fct_reorder(factor(Country), Total_cases, .desc = TRUE)) %>% 
  filter(DateRep>as.Date("2020-02-01"))
# create the plot
plot <- ggplot(plot_data, aes(x=DateRep, y=Cases, colour=Country)) +
  geom_line(size=0.6) + #scale_y_log10(labels=scales::comma) +
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(x="Date", y="Confirmed Cases") +
  theme_bw(16)
# show an interactive plot
ggplotly(plot)

Questions

  1. Why didn’t we use the vector of countries that we identified earlier from the original dataset?
  2. What do you suggest influence the number of new cases?
1. Because the country names not necceraly match (check how USA appears in both datasets)
2. Any ideas?

Choroplet Map

Let’s visualise now what’s the current status around the globe. We will use just the most current data (from the latest Date). We will create a color scale to represent the number of cases in each country.

# download map data
world_map <- download_map_data("custom/world-palestine-highres")
mapdata <- get_data_from_map(world_map)

# select only most current data
choroplet_data <- covid_daily_data %>%  group_by(Country) %>% arrange(desc(DateRep)) %>%
  slice(1) %>% mutate(log_cases=log(Total_cases))

# # define colors
bins <- c(1, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000)
cols <- as.character(paletteer_c("viridis::inferno", length(bins), direction = -1))
stops <- data.frame(q=1:length(bins)/length(bins), c=cols) %>% list_parse2(.)
# stops <- data.frame(q=bins/bins[length(bins)], c=cols) %>% list_parse2(.)
# stops <- data.frame(name=bins, from=log(bins)/log(bins[length(bins)]), color=cols, stringsAsFactors = FALSE) %>% list_parse2(.)
# stops <- data.frame(q=rev(1-log(bins)/log(bins[length(bins)])), c=cols) %>% list_parse2(.)

# plot map
  highchart(type = "map") %>%
    hc_add_series_map(map = world_map, df = choroplet_data,
                      value = "log_cases", joinBy = c("iso-a3", "iso3c")) %>%
    hc_colorAxis(stops  = stops) %>%
    hc_tooltip(useHTML=TRUE,headerFormat='',
               pointFormat = paste0('{point.Country} confirmed cases : {point.Total_cases}<br/>Deaths: {point.Total_Deaths}')) %>%
    hc_mapNavigation(enabled = TRUE) %>%
  hc_title(text = "Number of confirmed Covid 19 cases by Country",
           margin = 40, align = "left",
           style = list(color = "#2b908f", useHTML = TRUE)) %>%
  hc_subtitle(text = glue::glue("Data last updated on {format(max(choroplet_data$DateRep), '%d/%m/%Y')}"),
              align = "left",
              style = list(color = "#2b908f", fontWeight = "bold")) %>%
  hc_exporting(enabled = TRUE)

What else can we plot or investigate?

Additional Resources

Contact

Please contact me at i.bar@griffith.edu.au for any questions or comments.